Towards 100M Morphologically Annotated Corpus of Tajik

نویسندگان

Gulshan Dovudov

Vit Suchomel

Pavel Smerk

چکیده

The paper presents a work in progress: building morphologically annotated corpus of Tajik language of the size more than 100 million tokens. The corpus is and will be by far the largest available computer corpus of Tajik: even its current size is almost 85 million tokens. Because the available text sources are rather scarce, to achieve the goal also the texts of a lower quality have to be included. This short paper briefly reviews the current state of the corpus and analyzer, discusses problems with either “normalization” or at least categorization of low quality texts and finally also the perspectives for the nearest future.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Named Entity Recognition for Linguistic Rapid Response in Low-Resource Languages: Sorani Kurdish and Tajik

This paper describes our construction of named-entity recognition (NER) systems in two Western Iranian languages, Sorani Kurdish and Tajik, as a part of a pilot study of Linguistic Rapid Response to potential emergency humanitarian relief situations. In the absence of large annotated corpora, parallel corpora, treebanks, bilingual lexica, etc., we found the following to be effective: exploiting...

متن کامل

Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora

We present a method for automatically learning inflectional classes and associated lemmas from morphologically annotated corpora. The method consists of a core languageindependent algorithm, which can be optimized for specific languages. The method is demonstrated on Egyptian Arabic and German, two morphologically rich languages. Our best method for Egyptian Arabic provides an error reduction o...

متن کامل

Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish

The paper discuses problems in annotating a corpus containing Polish clinical data with low level linguistic information. We propose an approach to tokenization and automatic morphologic annotation of data that uses existing programs combined with a set of domain specific rules and vocabulary. Finally we present the results of manual verification of the annotation for a subset of data.

متن کامل

Morphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic

We present new language resources for Moroccan and Sanaani Yemeni Arabic. The resources include corpora for each dialect which have been morphologically annotated, and morphological analyzers for each dialect which are derived from these corpora. These are the first sets of resources for Moroccan and Yemeni Arabic. The resources will be made available to the public.

متن کامل

Building a 50M Corpus of Tajik Language

Paper presents by far the largest available computer corpus of Tajik Language of the size of more than 50 million words. To obtain the texts for the corpus two different approaches were used. The paper brings a description of both of them, discusses their advantages and disadvantages and shows some statistics of the two respective partial corpora. Then the paper characterizes the resulting join...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Towards 100M Morphologically Annotated Corpus of Tajik

نویسندگان

چکیده

منابع مشابه

Named Entity Recognition for Linguistic Rapid Response in Low-Resource Languages: Sorani Kurdish and Tajik

Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora

Towards Morphologically Annotated Corpus of Hospital Discharge Reports in Polish

Morphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic

Building a 50M Corpus of Tajik Language

عنوان ژورنال:

اشتراک گذاری